A semantic feature enhanced YOLOv5-based network for polyp detection from colonoscopy images

Colorectal cancer (CRC) is a common digestive system tumor with high morbidity and mortality worldwide. At present, the use of computer-assisted colonoscopy technology to detect polyps is relatively mature, but it still faces some challenges, such as missed or false detection of polyps. Therefore, how to improve the detection rate of polyps more accurately is the key to colonoscopy. To solve this problem, this paper proposes an improved YOLOv5-based cancer polyp detection method for colorectal cancer. The method is designed with a new structure called P-C3 incorporated into the backbone and neck network of the model to enhance the expression of features. In addition, a contextual feature augmentation module was introduced to the bottom of the backbone network to increase the receptive field for multi-scale feature information and to focus on polyp features by coordinate attention mechanism. The experimental results show that compared with some traditional target detection algorithms, the model proposed in this paper has significant advantages for the detection accuracy of polyp, especially in the recall rate, which largely solves the problem of missed detection of polyps. This study will contribute to improve the polyp/adenoma detection rate of endoscopists in the process of colonoscopy, and also has important significance for the development of clinical work.


Traditional polyp detection model
As is known to all, the analysis of endoscopic images is mainly performed by endoscopists, which leads to the fact that the accuracy of the judgment of endoscopic images is very dependent on the level of the physician.However, accurate analysis of endoscopic images is difficult for inexperienced endoscopists.The development of a real-time system capable of automatic polyp detection could help physicians improve their ability to analyze endoscopic images.Computer-aided diagnosis (CAD) of colonoscopy has always been a hot spot in artificial intelligence research.CAD can display in real time and prompt endoscopists to pay attention to polyps that may be overlooked, thereby improving the detection rate of adenoma 16,17 .
Traditional polyp detection algorithms usually use artificial features such as texture, shape, and color features to detect polyps.Krishnan et al. devised a method to obtain the location of polyps in images by extracting haustra folds in medical images 18 .Kang et al. designed a fast detection system considering the real-time nature of polyp detection in practical applications.The system segments medical images and then categorizes them separately to find out the location of polyps 19 .Bernal et al. used valley information to design a detection algorithm to determine intact polyp boundaries based on the properties of the polyp surface 20 .Alexandre et al. devised a method to binary classify each segmented polyp sub-image using support vector machine (SVM) to determine whether or not it contains polyps 21 .Li et al. proposed a method that uses multiple patches of different sizes to represent polyp images, each of which is classified by SVM to determine if it is a polyp 22 .Manouchehri et al. designed a new network model for frame detection of polyps based on Visual Geometry Group Net (VGGNet), based on which polyps were segmented using post-processing methods in a fully convolutional neural network 23 .Mostafiz et al.  performed feature extraction for the color of polyps, which in turn led to the design of an intelligent system for gastrointestinal polyp detection in endoscopic videos by fusing 2D empirical modal decomposition and deep neural network features 24 .Billah et al. designed a new system to address the shortcomings of the traditional algorithm with a high leakage rate, which automatically extracts the colored wavelet features in the video frames for training the linear SVM to achieve the classification of polyps 25 .Hasan et al. designed a method for gastrointestinal polyp detection by fusing contour wavelet transform and neural features.The method uses minimum redundancy maximum relevance (MRMR) dimensionality reduction method to extract deep features from the image and then SVM to diagnose gastrointestinal polyps and label the detected polyp regions 26 .
Traditional polyp detection models usually adopt a strategy based on sliding windows in region selection, but the sliding windows are of different sizes and are not targeted, resulting in redundant windows and high time complexity.In addition, the robustness of the feature extraction method is poor due to polyp image morphological diversity, illumination change diversity and background diversity.For example, due to the smooth surface of some polyps, they show the same shape-texture features as the normal lining in endoscopic images, and traditional algorithms tend to miss such polyps; the normal inner wall of the colon has a raised structure, and traditional algorithms are easy to misdetect it as polyps.Therefore, traditional polyp detection algorithms are difficult to complete the detection task well.

Two-stage polyp detection model
The last several years, with the excellent performance of AlexNet on ImageNet, deep learning has shined in the field of computer vision, and various performances far exceed traditional algorithms, and the field of polyp detection is no exception.To find the optimal polyp detection algorithm, Bernal et al. investigated a large number of deep learning detection methods as well as the original manual feature extraction-based detection methods.A comparison of the study reveals that the detection methods using the deep learning approach show superior performance for both single image detection and medical image frame detection, and the deep learning approach is more convenient than the manual feature extraction approach in practical clinical applications 27 .
In the design of the two-stage detection model, CNN have been applied to practical clinical polyp detection, and the leakage rate has been kept within a low range.Tajbakhsh et al. proposed a computer-aided detectionbased hybrid context-shape polyp detection method.The algorithm first removes the non-polyp edges from the edge map using contextual information and then locates the polyp candidate regions in the improved edge map using multi-scale temporal information based on color and texture 28 .In the detection of colon cancer polyps, the convolutional neural network model is susceptible to small perturbations and noise, which can miss the detection of neighboring polyps in medical image frames, increasing the number of false negatives and a decrease in the accuracy of the detection model.Qadir et al. designed a two-stage approach using region-of-interest and falsenegative reduction units, respectively.The algorithm provides an overall performance improvement in terms of sensitivity, accuracy and specificity 29 .
Mo et al. studied and compared many colonoscopic polyp detection methods and found that: traditional methods are difficult to be applied to practical applications due to lower accuracy and higher time complexity, while neural network-based detection models are extraordinary in performance, among which the Faster R-CNN algorithm polyp detection based on Faster R-CNN algorithms performs the best and achieves satisfactory results, which can be used in clinical practice 30 .Qadir et al. analyzed the differences in polyps in terms of contrast, size, and texture, used Mask R-CNN as a baseline model and replaced its feature extractors, and compared the improvement in detection and segmentation performance of each feature extractor.Finally, an integrated method is proposed for polyp detection and segmentation with good results on the MICCAI dataset 31 .Tashk et al. modified the region suggestion network to localize polyps from video frames acquired from a colon capsule endoscope.This method can accurately detect polyps in the video stream and, in addition to that, provide a predictive score for the risk of them being malignant tumors 32 .Patel et al. used several neural networks to classify polyps and compared their effectiveness on a publicly available polyp dataset.Ultimately, the VGG-19 model was found to perform better on the dataset than that of residual network (ResNet), Densely Connected Convolutional Network (DenseNets) and Squeeze-and-Excitation Networks (SENet) and other models 33 .Hasan et al. chose the optimal polyp detection strategy by combining different CNN architectures and feature extractors 34 .In addition, Tang et al. utilized transfer learning for computer-aided colonoscopy polyp detection.By applying a pre-trained CNN model to a colonoscopy image dataset, the method can effectively identify and detect polyps in the colon.Compared to training the model from scratch, it has faster training convergence and better generalization ability 35 .
Although the two-stage polyp detection model is the mainstream target detection method, the high time complexity prevents its practical application in medicine.For example: when the R-CNN model extracts the candidate frame of the polyp image, about 2000 candidate areas/candidate frames are generated, and each candidate area/candidate frame must enter the CNN network for feature extraction and SVM classification.This creates a lot of redundant operations, and thus takes a lot of training time.In addition, because the fully connected layer in the CNN network needs to input a fixed-size picture, and the preprocessing operations such as cropping and correcting the picture after generating the candidate area will affect the quality and content of the picture.

Single-stage polyp detection model
Liew et al. designed a new network, ResNet-50, based on a residual network and incorporating principal component analysis and AdaBoost integrated learning.The method merges three publicly available datasets, CVC-Clin-icDB, STIS-LaribPolypDB, and Kvasir, for training the designed model and applies techniques such as contrast enhancement, image thresholding, and median filtering to it to reduce the interference of noise 36 .Influenced by the DETR detection model, Shen et al. designed an end-to-end Convolutional Transformer Network (COTR).This network takes into account the slower convergence rate of DETR and embeds a convolutional layer into the Transformer encoder for accelerating model convergence and feature reconstruction 37 .Wang et al. designed new lightweight models VGGNets-GAP and ResNtes-GAP by introducing global average pooling based on two networks, VGGNets, and ResNtes, from the perspective of improving the accuracy and reducing the model complexity 38 .Qadir et al. introduced a two-dimensional Gaussian mask in a single feedforward network model to reduce the leakage rate of polyps.Experiments show that this method performs outstandingly in the case where the polyp boundary is more ambiguous with the background 39 .Facing the challenges posed by heterogeneous polyp datasets, Li et al. proposed a low-rank module to achieve accurate segmentation of polyps to enhance the generalization ability of the model, using the high-resolution HRNet model as a benchmark 40  www.nature.com/scientificreports/You Only Look Once (YOLO) series models are the more mainstream target detection algorithm nowadays, based on which some scholars have proposed an improved version of YOLO to achieve more accurate recognition 42 .Based on the YOLO network framework, Luo et al. designed a polyp detection system to meet the needs of the clinical environment, which has a high PDR, and the effect is especially obvious on small polyps 43 .Guo et al. designed a new polyp detection algorithm based on YOLOv3 combined with an active learning approach from the perspective of reducing the detection time, which can reduce the false-positive rate in the automatic detection of colon polyps 44 .Cao et al. investigated the difficulty of detecting small gastric polyps for different sizes of gastric polyps, with a special focus on small gastric polyps, and constructed a new feature extraction module and feature fusion module in the YOLOv3 model.This method can combine the semantic information of high-level feature maps with low-level feature maps, which is effective for the detection of small gastric polyps 45 .Pacal et al. applied a cross-stage partial network to the entire architecture and redesigned the backbone structure to address the shortcomings of YOLOv4.The real-time performance and accuracy of this approach far exceeded that of the initial model on the publicly available dataset 46 .Chou et al. first applied discrete wavelet transform to extract the texture features of polyps to enhance the unobvious texture features in polyp images, and then used the pattern-based generative adversarial network to enhance the image data, and finally detected polyps based on YOLOv4.Polyps are better detected 47 .
Although the single-stage polyp detection model has a certain advantage in time, there are many polypoid structures with strong edges in the colon, including colonic folds, blood vessels, mirror lights, lumen areas, air bubbles, etc.These complex colonic environments lead to Too many false positives appear in the detection effect of the algorithm.Moreover, some algorithms are not effective in detecting the presence of multiple polyps or small polyps in a picture.

Architecture of the proposed network
In the case of limited data sets, in order to reduce the missed or false detection rate of polyps, this paper proposes a multi-attention mechanism colorectal cancer polyp detection model based on YOLOv5.The network structure of the model consists of image input, backbone, neck and prediction head, and its structure is shown in Fig. 1.
In Fig. 1 the K represents the size of the convolution kernel.For example, when K is 5, it means that the size of the convolution kernel is 5 × 5.
In this model, New CSP-DarkNet53 as its backbone network uses Focus, C3, Spatial Pyramid Pooling-Fast (SPPF) and Context Feature Augmentation (CFA) with convolution-batch normalization-ReLU activation (CBL) as the basic convolution unit to extract features from colorectal images.The Focus operation achieves highquality downsampling, which splits a high-resolution feature map into multiple low-resolution feature maps using a slice operation.SPPF is a spatial pyramid pooling layer, which is designed to further increase the receptive field  www.nature.com/scientificreports/ of the feature map.It enables polyps to be well detected when images are input at different scales and effectively avoids the image distortion problem caused by cropping and scaling operations on colorectal images.SPPF, which stacks 3 identical max-pooling layers with convolution kernel size 5 × 5 in series, further increases the receptive field through continuous maximum pooling, and solves the problem of repeated extraction of polyp feature information by the neural network.However, this direct fusion of information of different densities will lead to semantic conflicts, limit the expression of multi-scale features, and easily make micro-polyp features submerged in conflicting information.In order to enable micropolyps to be detected, a CFA is designed in this paper, which uses expanded convolution to extract contextual information in different receptive fields to enhance feature expression capabilities and integrate it on top of the backbone network.A coordinated attention mechanism is connected to the path aggregation network (PAN) after CFA, the purpose of which is to enhance the channel connection between each feature, improve the detection accuracy, and ensure the running speed at the same time.
The neck network of the model uses the PAN structure to fuse the feature information of polyps.The PAN introduces bottom-up pathways to gradually aggregate and integrate the polyp features of different scales, thus enabling the network to provide a more comprehensive and rich representation of the polyp features.First, the network performs upsampling from top to bottom, so that the underlying feature map contains more semantic information of the image, and secondly, it performs downsampling from bottom to top, so that the top layer structure of the network can express more accurate location information of polyps.Finally, the two features are fused, so that the polyp feature information and location information can be reflected in the feature maps of each size to ensure an accurate prediction of polyps.To ensure that the model extracts the feature information of polyps more accurately, this paper introduces coordinated attention into the PAN structure.Finally, the feature information of three different scales is output as the prediction head to detect polyps of different scales.

Input
In the process of polyp image preprocessing, due to the lack of data volume, it takes a lot of manpower and time to label the data at the same time, and the target detection algorithm needs a large amount of high-quality data for model training.Therefore, in the input stage, this paper first adaptively scales the input image, and uses the Kmeans++ clustering algorithm to automatically learn and adjust the size of the anchor box to achieve better prediction of the target location of colorectal polyps.On this basis, the Mosaic data augmentation method is used to address the lack of data.This data enhancement method randomly selects four images from the data set, and combines the rotated, scaled, and deformed four images to form a new polyp image.The basic principle of Mosaic data enhancement is shown in Fig. 2 below.

P-C3
To further improve the feature representation of the model, a new structure called P-BottleNeck is designed in this paper.The details of P-BottleNeck is shown in Fig. 3.The two Convs units are connected in parallel with an extra shortcut to prevent the loss of polyp features.Then an add operation is performed on a selectable residual link after a 3 × 3 convolution.where, k1 and k3 represent convolution kernels of sizes 1 × 1 and 3 × 3 , respectively, s1 represents a convolution with a step size of 1, p0 represents a padding of 0 in the convolution, and c represents the channel of the convolution.
A new type of cross-stage partial network is designed using the P-BottleNeck structure and named P-C3.The details of the P-C3 module is shown in Fig. 4, where the input features undergo a layer of convolution into the P-BottleNeck structure, connected with an additional convolution to achieve a richer combination of gradients.Finally, the output features of the module are obtained after a 1 × 1 convolution.
We introduce the P-C3 module into the backbone and neck of the model.In the backbone network, we use the P-C3 module with residual structure, which enhances the feature extraction and mitigates the problem of gradient disappearance.The P-C3 structure deepens the depth of the network and enlarges the receptive field, which enables the model to extract richer polyp feature informations and enhances the feature expression ability.

CFA module
During a colonoscopy, polyps are difficult to detect due to their small size.The limitations of the network and the imbalance of the training dataset are the main reasons for the poor performance of tiny object detection.Therefore, this paper designs a contextual feature fusion module, which uses dilated convolution to extract the contextual information of different receptive fields, and fuses it to the top of the backbone network to enhance the contextual feature information of tiny polyps.In this paper, dilated convolutions with different dilated convolution rates are used to obtain contextual information of different receptive fields to enrich the contextual information of PAN, and its structure is shown in Fig. 5.
The CFA module includes four parallel context reasoning branches, aiming to leverage contexts of different sizes for decentralized discovery.The first branch contains one 3 × 3 dilated convolution with a dilation rate 1, and the second branch contains one 3 × 3 dilated convolution with a dilation rate 2. The role of these two branches is to be used to access the local context information.The third branch sequentially stacks two 3 × 3 dilated convolutions with dilation rates 2 and 4, and the fourth branch sequentially stacks two 3 × 3 dilated convolutions with dilation rates 3 and 6, which are used to access larger contexts with larger dilation rates.Then, each branch reduces the channel by a 1 × 1 convolution with a dilation rate 1, and the reduced four feature maps are spliced in the chan- nel dimension.Finally, the spliced feature maps are again fused with polyp features from different receptive fields using one 1 × 1 dilated convolution with a dilation rate 1 to output the final context feature-enhanced map.

Coordinate attention mechanism
The attention mechanism enables the model to better focus on polyp feature information and suppress noncritical feature information with low weight, enabling the model to extract more accurate semantic information about polyps.Currently, the mainstream attention mechanisms contain Squeeze-and-Excitation attention (SE), Convolutional Block Attention Module (CBAM), etc.The SE enhances the critical information in the feature map by learning the importance of global channels.However, the SE only considers the encoding of inter-channel information and ignores the importance of polyp location information.The CBAM solves the shortcomings of SE by combining channel attention and spatial attention and learns the importance of each spatial location through the spatial attention mechanism.However, its high computational complexity makes it difficult to apply to real-time detection of polyps.
In the process of algorithm design, to make the model locate and identify polyps more accurately, and to improve the polyp detection accuracy under the premise of ensuring the inference speed, we introduce a simple and flexible coordinated attention mechanism (CAM) to pay special attention to the important regions of the image.The specific process of this attention is shown in Fig. 6.
The CAM not only captures the information of polyp features across channels and enhances the channel connection among features, but also captures the information of direction perception and position perception, which helps the model to accurately detect polyps and achieve precise localization.In addition, The CAM attention is flexible and lightweight and can be applied to real-time polyp detection tasks.
In order to avoid all the spatial information being compressed into the channel, resulting in the inability to capture long-range spatial interaction with precise location information, the coordinated attention mechanism decomposes the global average pooling on the spatial dimension into two directions of height and width, and obtains two scales respectively.The feature maps of C×H×1 and C×1×W are as follows: where x represents the feature map, h, w, and c represent the height, width, and number of channels of the feature map.Z h c and Z w c represent the perceptual attention maps obtained by feature aggregation along the two spatial dimensions of height and width, respectively.The i and j represent the positional information of the feature maps in terms of height and width.
(1) www.nature.com/scientificreports/Next, the feature map C×1×W with the width dimension of the global perceptual field is obtained by transforming it into C×W×1 and stitching it with the feature map C×H×1 on the height, and reducing the channel dimension to 1/r of the original by the shared convolution module to obtain the feature map F 1 .Then, the feature map F 1 , which is processed by batch normalization, is activated using the Sigmoid activation function to obtain the feature map f ∈ R C/r×(H+W)×1 , as follows: where Z h and Z w represent the feature maps in both height and width dimensions, and δ represents the sigmoid activation function.
Then, the feature map f is restored to the same number of channels along the spatial dimension as the original feature map size to obtain the feature maps f h ∈ R C/r×H×1 and f w ∈ R C/r×W×1 .The feature maps f h ∈ R C/r×H×1 and f w ∈ R C/r×W×1 are Sigmoid activated in turn to obtain the attention weights g h ∈ R C×H×1 in height and g w ∈ R C×W×1 in width direction of the original feature map. the equations are shown below: Finally, the attention weights g h and g w in the height and width directions obtained above are weighted and multiplied on the original feature map to output the polyp feature map with attention weights, and the equations are shown below:

Loss function
The loss function of the polyp detection model used in this paper includes classification loss, regression loss and confidence loss.Its loss function can be described as follows: where L cls stands for classification loss, L boxes stands for regression loss, and L obj stands for confidence loss.In which the regression loss function of the bounding box is calculated as: where coord represents the regression loss coefficient of the bounding box, I i,j represents whether the jth anchor in the i-th cell contains the target polyp, B represents the prediction box, and B g represents the true box.c rep- resents the diagonal length of the smallest rectangle that can contain both the prediction box and the true box enclosed, and d represents the Euclidean distance between the centroids of the true and prediction boxes.The parameter α represents the positive weight, v measures the consistency of the aspect ratio.

Ethical statements
We confirm that all methods in this paper were carried out in accordance with relevant guidelines and regulations, and all experimental protocols were approved by Ethics Committee of Huai'an Second People's Hospital.We confirm that informed consent was obtained from all subjects and/or their legal guardian(s).

Dataset and implementation
To validate the superiority of the proposed method in this paper, we collected 1200 colorectal images containing 1-4 colon polyps each from the endoscopy center of a local hospital, constructed a WCYZ dataset, and divided the training set, test set and validation set in the ratio of 8:1:1.Figure 7a shows some example plots of polyp images, and Fig. 7b illustrates the real boxes labeled using bounding boxes.Stochastic gradient descent (SGD) was chosen as the optimizer for all experiments with an initial learning rate of 0.01, a momentum of 0.9, a batch size of 16, and 200 training epochs.

Evaluation metrics
In this paper, precision, recall and F-score are used to evaluate the detection performance of the model, which are defined as follows: (2) Among them, TP, FP, and FN represent the number of true positives, false positives, and false negatives, respectively, that is, the number of polyps that were correctly detected and labeled, the number of falsely detected polyps, and the number of undetected polyps.F-score provides an overall evaluation by comprehensively considering the precision and recall indicators.The specific formula is defined as follows:

Ablation experiments
In order to evaluate the contribution of the introduction of the P-C3 model, contextual feature augmentation and coordinate attention mechanism modules to network detection capabilities, this paper conducts experiments using YOLOv5 as a benchmark.The experimental results on the WCYZ dataset listed in Table 1 shows the performance indicators of the model after adding the P-C3 module, contextual feature augmentation module, and the attention mechanism.
As can be seen from Table 1, the introduction of the P-C3 module allows the network to extract richer polyp features and achieve higher detection accuracy.The CFA module has made a huge contribution to the improvement of the detection rate.This is because there are polyps of different sizes in the data set.The CFA module uses expanded convolution to extract the context information of different receptive fields, and integrates it into the PAN to improve the context feature information of tiny polyps, enabling the model to perform better in the face of small target polyps, thereby improving the recall of the model.www.nature.com/scientificreports/

Visualization result analysis
In this paper, heat maps were used to visualize the results of the polyp detection.By observing the heat distribution in the heat map, the model's ability to detect polyps and the accuracy of polyp localization can be visually assessed.The details of the visualization are shown in Fig. 8.
Comparing the polyp heat maps of the original model in Fig. 8b and the improved model in Fig. 8c, it can be seen that the improved network model has gained an improvement in the accuracy and coverage of focusing on the target region of polyp detection, which proves that our method can help the deep convolutional network to extract more critical polyp feature information.

Comparison of experimental results with other methods
In the field of colorectal cancer research, we should usually adjust the parameters or rule codes of the algorithm model to ensure the priority of the recall rate, find out more polyps that may have abnormalities, and reduce the risk of "missed detection".Next, in order to verify the effectiveness of the proposed method, we used popular deep learning-based detection algorithms R-CNN 48 , Faster R-CNN 49 , YOLOv4 50 , YOLOv7 51 , YOLOv8 52 and RT-DETR-R50 53 to conduct comparative experiments on model detection accuracy and speed.The quantitative results of each detection model on our test set are shown in Table 2.
It can be seen from the comparative data in Table 2 that our method shows obvious advantages in the detection rate and speed of polyps.Compared with the two-stage R-CNN model, recall has increased by 2.7 percentage points, and the speed has been greatly reduced.Compared with the single-stage YOLOv8 detection algorithm, recall has increased by 4.5 percentage points, reaching 92.3%.

Test results visualization
To visualize the test results of polyp detection, Figs. 9, 10 and 11 show the detection results of some polyps in the test set.
In the process of polyp detection, the polyps in the image have a similar color to the background, making their detection difficult and preventing accurate identification.For processing images with weak contrast between polyps and background, the polyp detection results are shown in Fig. 9. From the figure, we can see that our model has obvious effect for processing polyps with weak contrast.
In the colon images, smaller polyps are difficult to be detected.The experimental results of the improved model for smaller polyps detection are shown in Fig. 10.We can see that, from the figure, the detection algorithm    in this paper can accurately identify and localize them.This is due to the introduction of CFA, which enables the model to handle tiny polyps well.
In addition, when detecting polyps, there may be multiple polyps, and the method proposed in this paper can also handle this situation well.The polyp detection results are shown in Fig. 11.We can see that, from the figure, the method we proposed shows good performance for the detection of multiple polyps of different sizes.

Comparison of polyp detection results
In order to verify the effectiveness of our model in real detection, we compare the detection results with those of the latest YOLOv5, YOLOv7 and YOLOv8 modeling algorithms.We used four sets of scene images to qualitatively evaluate the detection performance of the model, and the polyp detection results are shown in Fig. 12.
In the first set of images, YOLOv5 faced challenges in accuracy, YOLOv7 and YOLOv8 had more similar performance, and our model performed well.In the second set of images, due to the weak contrast between the polyps and the image background, the YOLOv5 network model has missed detection, while the proposed model can accurately detect polyps with an accuracy of 93%, which proves that the improved model can efficiently deal with the targets with weak contrast.The YOLOv7 and YOLOv8 algorithms perform relatively better, but still lower than our proposed model.In the third set of images, due to the presence of tiny polyps, the YOLOv5, YOLOv7, and YOLOv8 models were not effective in detecting tiny polyps.In contrast, the improved model was able to accurately capture the tiny polyp feature information and achieve precise localization.In the fourth set of images, the YOLOv7 model fails to detect polyps, while the YOLOv5 and YOLOv8 models localize polyps inaccurately, in contrast to our method, which accurately localizes polyp location information.
Overall, our improved model can achieve excellent performance in the polyp detection process.

Conclusion
The detection of polyps in colonoscopy images is an important part of medical image recognition.Due to the weak contrast of polyps in colonoscopy images and the small size of some polyps, it is difficult to detect polyps.This paper proposes a more refined polyp detection method on the basic network of YOLOv5, aiming to improve the detection rate of polyp detection and avoid missed detection.In this method, a new type of module named P-C3 is proposed and added to the backbone and neck network.The CFA is introduced at the bottom of the DarkNet53 backbone network to improve the expressiveness of the model.On this basis, the output features are passed into the CA module to make the model pay more attention to the polyp features.This study will contribute to the field of colon cancer polyp detection, which can greatly reduce the misdiagnosis rate of clinicians in endoscopic diagnosis and treatment, and will be beneficial to physicians in their clinical work.

Figure 1 .
Figure 1.Overall architecture of the proposed scheme.

Figure 2 .
Figure 2. Mosaic data enhancement diagram.The red squares in the figure are the bounding boxes of the polyps.

Figure 3 .
Figure 3.The structure of P-BottleNeck.

Figure 5 .
Figure 5.The structure of context feature augmentation.

Figure 7 .
Figure 7.The polyp object detection WCYZ dataset.(a) Polyp images samples.(b) Polyp images with bounding boxes.The red squares in the figure are the bounding boxes of the polyps.

Figure 8 .
Figure 8.Heat maps of the polyp in the dataset.(a) Original image of polyps.(b) Heat map of the polyp in YOLOv5.(c) Heat map of the polyp in the improved model.(The heat map was generated using grad-cam 1.4.6.URL=https://github.com/jacobgil/pytorch-grad-cam).

Figure 9 .
Figure 9.A subset of the detection results of polyps showing low contrasts to the background in WCYZ data set.The red squares in the figure are the bounding boxes of the polyps.

Figure 10 .
Figure 10.A subset of the detection results of small target polyps in WCYZ data set.The red squares in the figure are the bounding boxes of the polyps.

Figure 11 . 5 Figure 12 .
Figure 11.A subset of the detection results of multiple target polyps in WCYZ data set.The red squares in the figure are the bounding boxes of the polyps.

Table 1 .
Results of ablation experiments on the WCYZ.In the table, the optimal results obtained by each index are in bold.

Table 2 .
Comparison of different methods on the WCYZ.In the table, the optimal results obtained by each index are in bold.